Back

Systematic Biology

Oxford University Press (OUP)

All preprints, ranked by how well they match Systematic Biology's content profile, based on 121 papers previously published here. The average preprint has a 0.03% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.

1
Advances and applications of the closest-tree algorithm and Hadamard conjugation in phylogenetic inference

Alvarez Gonzalez, E.; Balam-Narvaez, R.; Angulo-Perez, D.; Duchen, P.

2024-12-11 evolutionary biology 10.1101/2024.12.06.627223 medRxiv
Top 0.1%
68.8%
Show abstract

In phylogenetic inference Hadamard methods and the closest-tree algorithm have been a promising alternative to likelihood-based methods. However, applications to actual biological problems have been limited so far. In the early nineties, Hendy and Penny (1993) developed the two-state closest-tree algorithm for estimating the optimal branch lengths of a phylogenetic tree, whose parameters correspond to the Cavenders molecular evolution model (CFN). Steel et al. (1992) then developed the four-state version of this method, whose parameters correspond to the Kimura 3STs molecular evolution model (K3ST). In both cases, formulas for solving the optimization problems were provided. Here, we do not only contribute with proofs for these formulas, but we also adapt this methodology to the orchid genus Lophiarella, whose phylogenetic relationships remain unclear. With this biological application, we show the efficacy of the closest-tree algorithm coupled with Hadamard conjugation, phylogenetic invariants and edge-parameter inequalities (in Fourier coordinates) in jointly inferring the tree topology and the molecular evolution model that best explains the data. Finally, we reconcile this phylogeny with biogeographical and morphological aspects within this genus.

2
Putting the F in FBD analyses: tree constraints or morphological data ?

Barido-Sottani, J.; Pohle, A.; De Baets, K.; Murdock, D.; Warnock, R.

2022-07-18 evolutionary biology 10.1101/2022.07.07.499091 medRxiv
Top 0.1%
68.5%
Show abstract

1The fossilized birth-death (FBD) process provides an ideal model for inferring phylogenies from both extant and fossil taxa. Using this approach, fossils (with or without character data) are directly considered as part of the tree. This leads to a statistically coherent prior on divergence times, where the variance associated with node ages reflects uncertainty in the placement of fossil taxa in the phylogeny. Since fossils are typically not associated with molecular sequences, additional information is required to place fossils in the tree. Previously, this information has been provided in two different forms: using topological constraints, where the user specifies monophyletic clades based on established taxonomy, or so-called total-evidence analyses, which use a morphological data matrix with data for both fossil and extant specimens in addition to the molecular alignment. In this work, we use simulations to evaluate these different approaches to handling fossil placement in FBD analyses, both in ideal conditions and in datasets including uncertainty or even errors. We also explore how rate variation in fossil recovery or diversification rates impacts these approaches. We find that the extant topology is well recovered under all methods of fossil placement. Divergence times are similarly well recovered across all methods, with the exception of constraints which contain errors. These results are consistent with expectations: in FBD inferences, divergence times are mostly informed by fossil ages, so variations in the position of fossils strongly impact these estimates. On the other hand, the placement of extant taxa in the phylogeny is driven primarily by the molecular alignment. We see similar patterns in datasets which include rate variation, however one notable difference is that relative errors in extant divergence times increase when more variation is included in the dataset, for all approaches using topological constraints, and particularly for constraints with errors. Finally, we show that trees recovered under the FBD model are more accurate than those estimated using non-FBD (i.e., non-time calibrated) inference. This result holds even with the use of erroneous fossil constraints and model misspecification under the FBD. Overall, our results underscore the importance of core taxonomic research, including morphological data collection and species descriptions, irrespective of the approach to handling phylogenetic uncertainty using the FBD process.

3
Does time matter in phylogeny? A perspective from the fossil record

Guenser, P.; Warnock, R. C. M.; Donoghue, P. C. J.; Jarochowska, E.

2021-06-11 paleontology 10.1101/2021.06.11.445746 medRxiv
Top 0.1%
66.3%
Show abstract

The role of time (i.e. taxa ages) in phylogeny has been a source of intense debate within palaeontology for decades and has not yet been resolved fully. The fossilised birth-death range process is a model that explicitly accounts for information about species through time. It presents a fresh opportunity to examine the role of stratigraphic data in phylogenetic inference of fossil taxa. Here, we apply this model in a Bayesian framework to an exemplar dataset of well-dated conodonts from the Late Devonian. We compare the results to those obtained using traditional unconstrained tree inference. We show that the combined analysis of morphology and stratigraphic data under the FBD range process reduces overall phylogenetic uncertainty, compared to unconstrained tree inference. We find that previous phylogenetic hypotheses based on parsimony and stratophenetics are closer to trees generated under the FBD range process. However, the results also highlight that irrespective of the inclusion of age data, a large amount of topological uncertainty will remain. Bayesian inference provides the most intuitive way to represent the uncertainty inherent in fossil datasets and new flexible models increase opportunities to refine hypotheses in palaeobiology.

4
The Implications of Over-Estimating Gene Tree Discordance on a Rapid-Radiation Species Tree (Blattodea: Blaberidae)

Evangelista, D. A.; Gilchrist, M. A.; Legendre, F.; O'Meara, B.

2019-07-28 evolutionary biology 10.1101/717660 medRxiv
Top 0.1%
62.0%
Show abstract

Patterns of discordance between gene trees and the species trees they reside in are crucial to the debate over the superiority of coalescent or concatenation approaches to tree inference. However, errors in estimating gene tree topologies obfuscate the issue by making gene trees appear erroneously discordant with the species tree. We thus test the prevalence of discordance between gene trees and their species tree using an empirical dataset for a clade with a rapid radiation (Blaberidae). We find that one model of codon evolution (FMutSel0) prefers gene trees that are less discordant, while another (SelAC) shows no such preference. We compare the species trees resulting from the selected sets of gene trees on the basis of internal consistency, predictive ability, and congruence with independent data. The species tree resulting from gene trees those chosen by FMutSel0, a set with low discordance, is the most robust and biologically plausible. Thus, we conclude that the results from FMutSel0 are better supported: simple models (i.e., GTR and ECM) infer trees with erroneously high levels of gene tree discordance. Furthermore, the amount of discordance in the set of gene trees has a large effect on the downstream phylogeny. Thus, decreasing gene tree error by lessening erroneous discordance can result in higher quality species trees. These results allow us to support relationships among blaberid cockroaches that were previously in flux as they now demonstrate molecular and morphological congruence.

5
State Space Misspecification in Morphological Phylogenetics: A Pitfall for Models and Parsimony Alike

Huang, E.

2025-04-26 evolutionary biology 10.1101/2025.04.22.650124 medRxiv
Top 0.1%
53.5%
Show abstract

Phylogenetic analysis relies on two fundamental levels of biological information: genotype and phenotype. Molecular data benefit from operating within a well-defined, finite state space (e.g., nucleotide alphabets), whereas morphological data present inherent challenges due to frequently ambiguous character states and variable state counts. In this study, I use simulated data to examine how state space misspecification (SSM), defined as the mismatch between the assumed and true state space, affects phylogenetic reconstruction. Results show that SSM generally reduces topological accuracy, with the extent of its impact depending on mutation rate, state space disparity, and the proportion of affected characters. Counterintuitively, under conditions typical of empirical morphological datasets (high proportions of binary characters and elevated mutation rates), SSM can improve topological precision. This creates a paradox where an incorrect model outperforms a correct one, though at the cost of distorted branch lengths. Importantly, the effects of SSM extend beyond model-based approaches. I demonstrate, through an extension of the no common mechanism (NCM) model, that standard maximum parsimony is consistent with the assumption that characters evolved under an SSM model--a largely overlooked feature. To address this, I propose a state-space-aware weighting scheme that accounts for variation in character state space. I also discuss additional strategies for mitigating SSM, including model adjustments and reducing reliance on oversimplified binary coding. This work underscores the need to explicitly address state space uncertainty in morphological phylogenetics. As morphology remains crucial for reconstructing deep-time lineages and integrating fossils, accounting for SSM is essential to improving the reliability of evolutionary trees.

6
A Time-calibrated Firefly (Coleoptera: Lampyridae) Phylogeny: Using Genomic Data for Divergence Time Estimation

Hoehna, S.; Lower, S. E.; Duchen, P.; Catalan, A.

2022-02-01 evolutionary biology 10.1101/2021.11.19.469195 medRxiv
Top 0.1%
51.5%
Show abstract

Fireflies (Coleoptera: Lampyridae) consist of over 2,000 described extant species. A well-resolved phylogeny of fireflies is important for the study of their population genetics, bioluminescence, evolution, and conservation. We used a recently published anchored hybrid enrichment dataset (AHE; 436 loci for 88 Lampyridae species and 10 outgroup species) and state-of-the-art statistical methods (the fossilized birth-death-range process implemented in a Bayesian framework) to estimate a time-calibrated phylogeny of Lampyridae. Unfortunately, estimating calibrated phylogenies using AHE and the latest and most robust time-calibration strategies is not possible because of computational constraints. As a solution, we subset the full dataset by applying three different strategies: (i) using the most complete loci, (ii) using the most homogeneous loci, and (iii) using the loci with the highest accuracy to infer the well established Photinus clade. The estimated topology using the three data subsets agreed on almost all major clades and only showed minor discordance within less supported nodes. The estimated divergence times overlapped for all nodes that are shared between the topologies. Thus, divergence time estimation is robust as long as the topology inference is robust and any well selected data subset suffices. Additionally, we observed an un-expected amount of gene tree discordance between the 436 AHE loci. Our assessment of model adequacy showed that standard phylogenetic substitution models are not adequate for any of the 436 AHE loci which is likely to bias phylogenetic inferences. We performed a simulation study to explore the impact of (a) incomplete lineage sorting, (b) uniformly distributed and systematic missing data, and (c) systematic bias in the position of highly variable and conserved sites. For our simulated data, we observed less gene tree variation which shows that the empirically observed amount of gene tree discordance for the AHE dataset is unexpected and needs further investigation.

7
Automatic Discovery of Optimal Discrete Character Models

Boyko, J.

2025-12-03 evolutionary biology 10.1101/2024.11.15.623760 medRxiv
Top 0.1%
44.2%
Show abstract

Modeling discrete character evolution in a Markovian framework has become common practice in phylogenetic comparative methods. The increasing size and complexity of these models reflects a trend of analyses to include more taxa and more discrete characters. However, as complexity of the models increase, so do the number of potential model structures and number of estimable parameters, making it nearly impossible to consider all modeling options for a given dataset. To overcome this issue, I apply a combination of regularization and simulated annealing to models of discrete character evolution. This allows for the automatic searching and optimization across different model structures without user specification. I test this framework under several simulation scenarios including hidden rates and multiple discrete characters. The results indicate that regularized models significantly outperform traditional approaches, yielding a far lower variance and a nearly tenfold reduction in the overall error of parameter estimates in the most extreme scenarios. I illustrate the power of automatic model selection by revisiting the ancestral state estimation of concealed ovulation and mating systems in Old World monkeys. Using the dredge algorithm, I discover a previously unexamined model structure which has both better statistical performance and a differing ancestral state reconstruction when compared to default model sets. In general, these results highlight the dangers of an over-reliance on default model sets. The combination of automatic model selection and regularization help overcome problems of over-parameterization, and these results demonstrate that when inferences are drawn from a larger model space, they can be both more statistically robust and biologically realistic.

8
A discrete character evolution model for phylogenetic comparative biology with {Gamma}-distributed rate heterogeneity among branches of the tree

Revell, L. J.; Harmon, L. J.

2024-05-30 evolutionary biology 10.1101/2024.05.25.595896 medRxiv
Top 0.1%
42.1%
Show abstract

Phylogenetic comparative methods are now widely used to measure trait evolution on the tree of life. Often these methods involve fitting an explicit model of character evolution to trait data and then comparing the explanatory power of this model to alternative scenarios. In this article, we present a new model for discrete trait evolution in which the rate of character change in the tree varies from edge (i.e., "branch") to edge of the phylogeny according to a discretized {Gamma} distribution. When the edge-wise rates of evolution are, in fact, {Gamma}-distributed, we show via simulation that this model can be used to reliably estimate the shape parameter () of the distribution of rate variation among edges. We also describe how our model can be employed in ancestral state reconstruction, and demonstrate via simulation how doing so will tend to increase the accuracy of our estimated states when the generating edge rates are {Gamma}-distributed. We discuss how marginal edge rates are estimated under the model, and apply our method to a real dataset of digit number in squamate reptiles, modified from Brandley et al. (2008).

9
Diagnosability to inform species delimitation for the genus Emydura (Testudines: Chelidae) from northern Australia

Georges, A.; Unmack, P. J.; Kilian, A.; Zhang, X.; Amepou, Y.; Dissanayake, D. S. B.

2025-07-11 genetics 10.1101/2025.07.10.664252 medRxiv
Top 0.1%
41.7%
Show abstract

Understanding the evolutionary history of diversifying lineages and the delineation of species remain major challenges for evolutionary biology. Here we use single nucleotide polymorphisms (SNPs) and sequence fragment presence-absence (SilicoDArT) data to combine phylogenetics and population genetics to assess species boundaries with a focus on diagnosability. We challenge current and proposed taxonomies in a genus of Australian freshwater turtles (Chelidae: Emydura) from northern Australia and southern New Guinea. In a six-step process, we combine phylogeny with the concept of diagnosability based on fixed allelic differences to select diagnosable lineages as candidate species. Four taxa are supported as diagnosable lineages, two of which we elevate to species status. The nuclear and mitochondrial phylogenies differed in important respects, which we attribute to recent or contemporary lateral transfer of mitochondria during hybridization events, deeper historical hybridization or possibly incomplete lineage sorting of the mitochondrial genome. Taxonomic decisions in cases of allopatry require subjective judgement. Our six-step strategy and the necessary (but not sufficient) criterion of diagnosability adds an additional level of objectivity before that subjectivity is applied, and so reduces the risk of taxonomic inflation that can accompany lineage approaches to species delimitation. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=124 SRC="FIGDIR/small/664252v1_ufig1.gif" ALT="Figure 1"> View larger version (35K): org.highwire.dtl.DTLVardef@1b89bc8org.highwire.dtl.DTLVardef@fb892dorg.highwire.dtl.DTLVardef@1eb09a8org.highwire.dtl.DTLVardef@1c7300c_HPS_FORMAT_FIGEXP M_FIG C_FIG

10
Supporting per-locus substitution rates improves the accuracy of species networks and avoids spurious reticulations

Cao, Z.; Ogilvie, H.; Nakhleh, L.

2022-01-18 evolutionary biology 10.1101/2022.01.16.476511 medRxiv
Top 0.1%
41.5%
Show abstract

The development of statistical methods to infer species phylogenies with reticulation (species networks) has led to many discoveries of gene flow between distinct species. However, because the dimensionality of species networks is not fixed, these methods may compensate for kinds of model misspecification, such as assuming a single substitution rate for all genomic loci, by increasing the number of dimensions beyond the true value. The popular full Bayesian species network method MCMC_SEQ has previously made this assumption, so we have added support for the proven Dirichlet model for per-locus rates to enhance its accuracy and avoid spurious results. We studied the effects of this model using simulation and an empirical dataset from Heliconius butterflies. We found that assuming a single substitution rate applies to all loci leads to the inference of spurious reticulation in simulated and empirical datasets when a full Bayesian method is used, however, the summary method InferNetwork_ML is robust to per-locus variation in substitution rates when set to ignore gene tree branch lengths. Our implementation of the model resolves this misspecification and successfully converges to the true species networks. It also infers far more accurate gene trees than assuming a single rate, or independent inference of gene trees. Our implementation of the Dirichlet per-locus rates model is now available in PhyloNet, a software package for phylogenetic inference, open source on GitHub https://github.com/NakhlehLab/PhyloNet.

11
How much information is there for inferring species trees?

Milkey, A.; Chen, J.; Lewis, P. O.

2026-04-02 evolutionary biology 10.64898/2026.04.01.715836 medRxiv
Top 0.1%
41.2%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWAs modern phylogenomics datasets become increasingly large, it is useful to develop recommendations for how to subsample datasets for best species tree inference. Here we apply a new measure of phylogenetic information content that estimates the reduction in tree space occupied by a posterior sample of inferred trees relative to a prior sample in order to assess the effects of gene tree parameters on species tree estimation. We find that, consistent with earlier studies, when data are informative, more data result in better species tree inference. However, when data are uninformative, subsampling a dataset to include only the most informative loci may produce a better species tree sample. We perform analyses on a variety of simulated and empirical datasets.

12
Impacts of Taxon-Sampling Schemes on Bayesian Molecular Dating under the Unresolved Fossilized Birth-Death Process

Luo, A.; Zhang, C.; Zhou, Q.-S.; Ho, S. Y. W.; Zhu, C.-D.

2021-11-19 evolutionary biology 10.1101/2021.11.16.468757 medRxiv
Top 0.1%
40.5%
Show abstract

Evolutionary timescales can be estimated using a combination of genetic data and fossil evidence based on the molecular clock. Bayesian phylogenetic methods such as tip dating and total-evidence dating provide a powerful framework for inferring evolutionary timescales, but the most widely used priors for tree topologies and node times often assume that present-day taxa have been sampled randomly or exhaustively. In practice, taxon sampling is often carried out so as to include representatives of major lineages, such as orders or families. We examined the impacts of these diversified sampling schemes on Bayesian molecular dating under the unresolved fossilized birth-death (FBD) process, in which fossil taxa are topologically constrained but their exact placements are not inferred. We used synthetic data generated by simulation of nucleotide sequence evolution, fossil occurrences, and diversified taxon sampling. Our analyses show that increasing sampling density does not substantially improve divergence-time estimates under benign conditions. However, when the tree topologies were fixed to those used for simulation or when evolutionary rates varied among lineages, the performance of Bayesian tip dating improves with sampling density. By exploring three situations of model mismatches, we find that including all relevant fossils without pruning off those inappropriate for the FBD process can lead to underestimation of divergence times. Our reanalysis of a eutherian mammal data set confirms some of the findings from our simulation study, and reveals the complexity of diversified taxon sampling in phylogenomic data sets. In highlighting the interplay of taxon-sampling density and other factors, the results of our study have useful implications for Bayesian molecular dating in the era of phylogenomics.

13
Enhancing Evolutionary Timelines: The Impact of Stratigraphic Range Information on Phylogenetic Inference

Stolz, U.; Gavryushkina, A.; Vaughan, T. G.; Stadler, T.; Allen, B. J.

2025-04-22 evolutionary biology 10.1101/2025.04.17.649084 medRxiv
Top 0.1%
40.0%
Show abstract

Coherent phylogenetic analyses of molecular and fossil datasets have deepened our understanding of evolutionary biology. A core model facilitating such coherent analyses is the fossilised birth-death (FBD) model which directly incorporates fossils within phylogenetic trees. However, a limitation of the FBD model is that it cannot assign multiple fossil occurrences of the same species, limiting our ability to accurately represent age information for species which have been sampled repeatedly. To address this gap, the Stratigraphic Ranges Fossilized Birth-Death (SRFBD) model has recently been introduced. This model can account for sampled strati-graphic ranges and integrate over occurrences within the range, enabling us to include more complete fossil age information within the inference. Here, building upon this mathematical work, we develop a computational method making the model accessible to the community, for more accurate total-evidence inference of dated phylogenetic trees and evolutionary parameters. In particular, we integrate the SRFBD model into BEAST2, facilitating the use of a diverse array of genomic substitution models and the inclusion of both morphological and molecular data when inferring phylogenies. We present a thorough validation of the SRFBD implementation against simulated data. We then demonstrate the differences in posterior parameter estimates and phylogenies when applying FBD and SRFBD models to two example datasets, the Spheniscidae (penguins) and the Canidae (dogs). In both examples, the SRFBD model produces older divergence times, lower diversification and turnover rates, and considerably higher sampling proportions compared to FBD model.

14
A New Information Theoretic Approach Shows that Mixture Models Outperform Partitioned Models for Phylogenetic Analyses of Amino Acid Data

Ren, H.; Jiang, C.; Wong, T. K. F.; Shao, Y.; Susko, E.; Minh, B. Q.; Lanfear, R.

2026-03-18 evolutionary biology 10.64898/2026.03.16.712229 medRxiv
Top 0.1%
39.4%
Show abstract

Partitioned and mixture models are widely employed in Maximum Likelihood phylogenetic analyses of large genomic datasets. Comparing the fit of the two types of models has been challenging, because standard information-theoretic approaches cannot be applied. Mixture models are increasingly popular for the analysis of amino acid datasets and can lead to different conclusions compared to partitioned models. This raises an important question - which type of model tends to perform better? Susko et al. (2026) recently introduced the marginal Akaike information criterion (mAIC), which allows mixture models and partitioned models to be directly compared for the first time. Here, we use the mAIC and a range of other approaches to compare the fit of mixture and partitioned models across a diverse set of empirical datasets. We show that mixture models are universally favoured on amino acid datasets. This has important implications for interpreting empirical analyses and suggests that continued development of mixture models is an important avenue for future research.

15
Evaluating the impact and detectability of mass extinctions on total-evidence dating

Du, M.; Wang, W.; Tan, J.; Barido-Sottani, J.

2025-09-30 evolutionary biology Community evaluation 10.1101/2025.09.28.679059 medRxiv
Top 0.1%
39.1%
Show abstract

Fossils are crucial for accurately dating phylogenetic trees because their ages provide vital constraints on the timing of macroevolutionary events, and their morphological characters offer key information on evolutionary rates and phylogenetic positions. The fossilized birth-death (FBD) process is a diversification model that incorporates both extant and extinct species, serving as a tree prior that seamlessly integrates fossils into phylogenetic inference. While the FBD model can account for mass extinctions, which caused rapid, widespread organismal loss, few studies have utilized FBD models incorporating these events in phylogenetic inference. This is likely because the detectability of mass extinctions and their impact on phylogenetic inference remain unclear. Through simulations, we assessed the influence of mass extinctions on divergence time and topology inference and evaluated the detectability of mass extinction signals in total-evidence dating. We examined three FBD tree priors: without mass extinction, with known mass extinction time and survival probability, and with known mass extinction time but unknown survival probability. Our results show that the FBD model with known mass extinction time and unknown survival probability was able to reliably detect mass extinctions when they occurred, and correctly refrained from detecting mass extinctions when they were absent. Moreover different FBD models generate similar divergence time and tree topology errors. Even when the FBD model used for tree inference did not explicitly account for mass extinction events, signals of mass extinction were still detectable on the resulting MCC trees. The accuracy of the detection was similar to the one obtained from MCC trees inferred using an FBD model that includes mass extinction parameters. We also reduced the fossilization rate and the number of morphological characters, obtaining results consistent with the aforementioned findings. However, reducing the fossilization rate decreased the accuracy of detecting mass extinctions when they occurred, and reducing the number of morphological characters decreased the accuracy of divergence time inference. Furthermore, we adjusted the priors for the existence of mass extinction and the survival probability of mass extinction. We found that the prior for the existence of mass extinction had no effect on inference, whereas the prior for the survival probability of mass extinction significantly influenced both the detection of mass extinctions and the estimation of survival probabilities. Finally, we applied these models to empirical datasets of tetraodontiform fishes and crinoids and found that, consistent with our simulation results, the inclusion of a mass extinction event in the tree prior had a negligible impact on the inferred topologies and divergence times.

16
Terraces in Gene Tree Reconciliation-Based Species Tree Inference

Sanderson, M.; McMahon, M. M.; Steel, M.

2020-04-18 evolutionary biology 10.1101/2020.04.17.047092 medRxiv
Top 0.1%
39.0%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWTerraces in phylogenetic tree space are sets of trees with identical optimality scores for a given data set, arising from missing data. These were first described for multilocus phylogenetic data sets in the context of maximum parsimony inference and maximum likelihood inference under certain model assumptions. Here we show how the mathematical properties that lead to terraces extend to gene tree - species tree problems in which the gene trees are incomplete. Inference of species trees from either sets of gene family trees subject to duplication and loss, or allele trees subject to incomplete lineage sorting, can exhibit terraces in their solution space. First, we show conditions that lead to a new kind of terrace, which stems from subtree operations that appear in reconciliation problems for incomplete trees. Then we characterize when terraces of both types can occur when the optimality criterion for tree search is based on duplication, loss or deep coalescence scores. Finally, we examine the impact of assumptions about the causes of losses: whether they are due to imperfect sampling or true evolutionary deletion.

17
Macroevolutionary analysis of discrete character evolution using parsimony-informed likelihood

Grundler, M.; Rabosky, D. L.

2020-01-08 evolutionary biology 10.1101/2020.01.07.897603 medRxiv
Top 0.1%
37.6%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWRates of character evolution in macroevolutionary datasets are typically estimated by maximizing the likelihood function of a continuous-time Markov chain (CTMC) model of character evolution over all possible histories of character state change, a technique known as maximum average likelihood. An alternative approach is to estimate ancestral character states independently of rates using parsimony and to then condition likelihood-based estimates of transition rates on the resulting ancestor-descendant reconstructions. We use maximum parsimony reconstructions of possible pathways of evolution to implement this alternative approach for single-character datasets simulated on empirical phylogenies using a two-state CTMC. We find that transition rates estimated using parsimonious ancestor-descendant reconstructions have lower mean squared error than transition rates estimated by maximum average likelihood. Although we use a binary state character for exposition, the approach remains valid for an arbitrary number of states. Finally, we show how this method can be used to rapidly and easily detect phylogenetic variation in tempo and mode of character evolution with two empirical examples from squamates. These results highlight the mutually informative roles of parsimony and likelihood when testing hypotheses of character evolution in macroevolution.

18
Ancestral state reconstruction with discrete characters using deep learning

Nagel, A. A.; Landis, M. J.

2026-03-21 evolutionary biology 10.64898/2026.03.19.712918 medRxiv
Top 0.1%
37.6%
Show abstract

Ancestral state reconstruction is a classical problem of broad relevance in phylogenetics. Likelihood-based methods for reconstructing ancestral states under discrete character models, such as Markov models, have proven extremely useful, but only work so long as the assumed model yields a tractable likelihood function. Unfortunately, extending a simple but tractable phylogenetic model to possess new, but biologically realistic, properties often results in an intractable likelihood, preventing its use in standard modeling tasks, including ancestral state reconstruction. The rapid advancement of deep learning offers a potential alternative to likelihood-based inference of ancestral states, particularly for models with intractable likelihoods. In this study, we modify the phylogenetic deep learning software O_SCPLOWPHYDDLEC_SCPLOW to conduct ancestral state reconstruction. We evaluate O_SCPLOWPHYDDLEC_SCPLOWs performance under various methodological and modeling conditions, while comparing to Bayesian inference when possible. For simple models and small trees, its performance resembles the performance of Bayesian inference, but worsens as tree size increases. While O_SCPLOWPHYDDLEC_SCPLOW still performs adequately for more complex models, such as speciation and extinction models, the estimates differ more from Bayesian inference in comparison with simpler models. Lastly, we use O_SCPLOWPHYDDLEC_SCPLOW to infer ancestral states for two empirical datasets, one of the ancestral ranges of a subclade of the genus Liolaemus and ancestral locations for sequences from the 2014 Sierra Leone Ebola virus disease outbreak.

19
Distinguishing between histories of speciation and introgression using genomic data

Hibbins, M. S.; Hahn, M. W.

2022-09-09 evolutionary biology 10.1101/2022.09.07.506990 medRxiv
Top 0.1%
33.2%
Show abstract

Introgression creates complex, non-bifurcating relationships among species. At individual loci and across the genome, both introgression and incomplete lineage sorting interact to produce a wide range of different gene tree topologies. These processes can obscure the history of speciation among lineages, and, as a result, identifying the history of speciation vs. introgression remains a challenge. Here, we use theory and simulation to investigate how introgression can mislead multiple approaches to species tree inference. We find that arbitrarily low amounts of introgression can mislead both gene tree methods and parsimony methods if the rate of incomplete lineage sorting is sufficiently high. We also show that an alternative approach based on minimum gene tree node heights is inconsistent and depends on the rate of introgression across the genome. To distinguish between speciation and introgression, we apply supervised machine learning models to a set of features that can easily be obtained from phylogenomic datasets. We find that multiple of these models are highly accurate in classifying the species history in simulated datasets. We also show that, if the histories of speciation and introgression can be identified, PhyloNet will return highly accurate estimates of the contribution of each history to the data (i.e. edge weights). Overall, our results highlight the promise of supervised machine learning as a potentially powerful complement to phylogenetic methods in the analysis of introgression from genomic data.

20
Species Tree Inference under the Multispecies Coalescent on Data with Paralogs is Accurate

Yan, Z.; Du, P.; Hahn, M. W.; Nakhleh, L.

2020-03-24 evolutionary biology 10.1101/498378 medRxiv
Top 0.1%
33.1%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWMany recent phylogenetic methods have focused on accurately inferring species trees when there is gene tree discordance due to incomplete lineage sorting (ILS). For almost all of these methods, and for phylogenetic methods in general, the data for each locus is assumed to consist of orthologous, single-copy sequences. Loci that are present in more than a single copy in any of the studied genomes are excluded from the data. These steps greatly reduce the number of loci available for analysis. The question we seek to answer in this study is: What happens if one runs such species tree inference methods on data where paralogy is present, in addition to or without ILS being present? Through simulation studies and analyses of two large biological data sets, we show that running such methods on data with paralogs can still provide accurate results. We use multiple different methods, some of which are based directly on the multispecies coalescent (MSC) model, and some of which have been proven to be statistically consistent under it. We also treat the paralogous loci in multiple ways: from explicitly denoting them as paralogs, to randomly selecting one copy per species. In all cases the inferred species trees are as accurate as equivalent analyses using single-copy orthologs. Our results have significant implications for the use of ILS-aware phylogenomic analyses, demonstrating that they do not have to be restricted to single-copy loci. This will greatly increase the amount of data that can be used for phylogenetic inference.